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Abstract. Community detection is a fundamental problem in the domain of complex- 
network analysis. It has received great attention, and many community detection 
methods have been proposed in the last decade. In this paper, we propose a 
divisive spectral method for identifying community structures from networks, which 
utilizes a sparsification operation to pre-process the networks first, and then uses 
a repeated bisection spectral algorithm to partition the networks into communities. 
The sparsification operation makes the community boundaries more clearer and 
more sharper, so that the repeated spectral bisection algorithm extract high-quality 
community structures accurately from the sparsified networks. Experiments show that 
the combination of network sparsification and spectral bisection algorithm is highly 
successful, the proposed method is more effective in detecting community structures 
from networks than the others. 


1. Introduction 

Many systems can be modeled as complex networks, in which vertices represent 
individuals and edges describe connections between them. A signihcant characteristic 
occurred in many networks is the so-called “community structure”, the tendency of 
vertices that can be partitioned into groups naturally, with denser connections between 
vertices within groups and sparser edges across groups [1, 2]. The communities can be 
groups of Web pages sharing the same topics in WWW networks [3, 4], or pathways in 
metabolic networks, or complexes in protein-protein interaction networks [5, 6, 7, 8, 9]. 

Identifying the community structures from networks is very important, because 
such structures can have signihcant inhuences on the function of networks. Therefore, 
there have been considerable researches on the problem of community detection, and 
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a large number of methods have been developed and applied to various networks. In 
this paper, we focus on spectral methods, especially on bisection spectral methods, 
for their sound theoretical principles. The spectral methods are originated as a 
kind of graph-partitioning methods [10, 11, 12, 13, 14], then they developed into a 
kind of classical methods for clustering [15, 16, 17, 18] and for community detection 
[19, 20, 21, 22, 9, 23, 24, 25, 26, 27, 28, 29] in the fields of data mining and complex 
network analysis, separately. 

For community detection, the spectral methods utilizes the eigenspectra of various 
types of network-associated matrix to identify the community structure. For instance, 
through analyzing the spectrum of the network Laplacian matrix, Donetti et al. [19] 
projected the network vertices into a tunable-dimensionality eigenvector space, and 
the community structure corresponding to the global maximum of modularity [2] over 
all possible dimensions of the eigenvector spaces was found finally. Arenas et al. [20] 
reported the existence of a connection between the spectral information of the Laplacian 
matrix and the hierarchical process of emergence of communities at different time scales, 
which can be utilized to extract community structures from networks. Based on the 
normalized Laplacian matrix and its eigenvalues, Chen et al. [21] demonstrated that the 
stable local equilibrium states of the diffusion process can reveal the inherent community 
structures of networks, which can be extracted through optimizing the conductance of 
networks directly. Newman [22] discussed the equivalence between community detection 
and the normalized-cut graph partitioning, and gave spectral algorithms based on the 
normalized Laplacian matrix of networks to solve the two types of problems. Lange et al. 
[9] examined the spectra of normalized Laplacian matrix of the macroscopic anatomical 
neural networks of the macaque and cat, and of the microscopic network of the C.elegans, 
and revealed an integrative community structure in these neural networks. 

In addition to the Laplacian matrix and the normalized Laplacian matrix, the 
eigenspectra of other types of network-associated matrix were used to extract community 
structures as well. For example, Chauhan et al. [23] found that the spectrum of the 
network adjacency matrix has some eigenvalues that are significantly larger than the 
magnitude of the rest of the eigenvalues, which indicated the number of communities in 
the network. Newman [24] divided the network vertices into two groups according to the 
signs of elements of the leading eigenvector of the “modularity matrix” first and then 
subdivided those groups based on the “generalized modularity matrix” recursively. Shen 
et al. [25] based on the network covariance matrix to uncover the multiscale community 
structure, and defined a “correlation matrix” to extract the multiscale community 
strucutre from the heterogeneous network utlizing its eigenvectors. And in Ref. [26], 
Shen et al. found that the normalized Laplacian matrix and the correlation matrix 
outperform the other three types of aforementioned matrix in detecting community 
structures from networks. To overcome the resolution limit problem of modularity, 
Nascimento [27] constructed a new network based on the leading eigenvectors of those 
“clustering coefficient matrix” calculated for every vertex to extract the final community 
structure. Capocci et al. [28] utilized the first few eigenvectors of the network transition 
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matrix to calculate the correlations between vertices to determine whether they belong to 
the same community or not. Gennip et al. [29] exploited a standard spectral clustering 
algorithm based on the transition matrix to identify social communities among gang 
members in the Hollenbeck policing district in Los Angeles. 

Among all these spectral methods, the bisection spectral methods are a special 
scenario. They divided the network into two parts utilizing some information of a certain 
eigenvector, such as the median value of the eigenvector components corresponding to 
the smallest non-zero eigenvalue of the Laplacian matrix for graph partitioning [12, 14], 
the signs of components of the leading eigenvector of the (generalized) modularity 
matrix [24], or the signs of the elements of the eigenvector corresponding to the second 
largest eigenvalue of the normalized Laplacian matrix [22] for community detection. All 
these literatures have derived the mathematical formulas as a support, hence benehted 
from the solid mathematical foundations, the results acquired by the bisection spectral 
methods are more interpretable, more credible and more persuasive than those based 
only on experiences or on empirical studies. 

When used in applications of traditional graph partitioning, such as VLSI circuit 
design, load balance or communication reduction in parallel computing, etc., the 
bisection spectral methods tended to partition the network into equal-sized subgraphs. 
For community detection, we need to obtain a community structure as natural as 
possible. For a two-community network, the bisection spectral methods can partition 
it into two parts corresponding to the community structure successfully [22]. However, 
in general cases, the network contains more than two communities. For those networks, 
a natural idea is to bisect the two subgraphs recursively after the hrst division, but 
it is not guaranteed to acquire the most natural community structure. That is the 
reason why Newman [24] subdivided the subgraphs based on the generalized modularity 
matrix after the hrst division rather than bisecting recursively based on the leading 
eigenvector of modularity matrix only. Even so, the result is not ideal. So Newman had 
to employ a vertex-moving strategy to hne-tune the communities after each division. The 
communities extracted by this method are, by dehnition, indivisible subgraphs, which 
are always too trivial in many networks to be acceptable, and the extracted community 
structure often deviates far from the ground truth. For this reason, Newman [22] pointed 
out that how to generalize the bisection spectral methods to networks containing more 
than two communities is still an open problem. 

In this paper, we propose a method to solve the problem. We observed that 
from several networks with apparent community structure, in which communities are 
separated clearly and sharply, the recursive bisection spectral method can extract the 
high-quality community structure dehnitely. Inspired by the observation, we propose a 
network-sparsihcation algorithm to promote the prominence of the community structure 
through removing some edges from the network. And then we propose a repeated 
bisection algorithm to extract the community structure from the sparsihed network. 

The remainder of this paper is organized as follows. In section 2, we demonstrate 
the observation mentioned above using an example network with apparent community 
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structure, the proposed method is elucidated in section 3, the experimental results is 
shown in section 4, and this paper is ended with a conclusion in section 5 . 

2. Observation 

Although the recursive bisection spectral method is not guaranteed to obtain the best 
community structures in general cases, we have observed that it does work well on 
some special networks. For example, the simple network illustrated in Figure 1(a) is a 
such special network, which contains 3 communities, and the community boundaries are 
evident. Applying the recursive bisection spectral method to this network can get the 
ideal result. Figure 1(b) shows the result of the first bisection. Bisecting recursively the 
two subgraphs in Figure 1(b), we obtain the resulting community structure presented 
in Figure 1(c). Obviously, it is identical to the ground-truth community structure. 

(b) (c) 

Figure 1. A simple network containing more than two communities. 

(a) The ground-truth community structure; (b) The community structure 
corresponding to the first bisection; (c) The community structure corresponding 
to the second bisection. The different vertex shapes and shades indicate 
different communities, the black lines represent edges within communities, and 
the light gray lines represent connections across communities. This illustration 
style also applies to the next figures. 

In fact, we have also tested the recursive bisection spectral method on some other 
networks that have the similar characteristics as the one illustrated in Figure 1(a), we 
observed that all results are satisfactory. That is to say, applying recursive bisection 
spectral method to networks, in which communities are well defined and separated 
clearly and sharply, can extract high-quality community structures. 

Inspired by this observation, we propose a method in this paper to extend the 
recursive bisection spectral method to a repeated bisection spectral method that can deal 
with networks which contain more than two communities and the community boundaries 
are not so sharp. We first remove some edges from the network to make the community 
boundaries more clearer and more sharper, then use the repeated bisection spectral 
method to extract the final community structure. 
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The proposed method is comprised of two algorithms. The hrst one is responsible 
for sparsifying the network to promote the prominence of the community structure by 
removing some edges from the network, the second one is the repeated bisection spectral 
algorithm to extract the community structure accurately from the sparsihed network. 

Facilitating the description of the proposed method, some notations are given in 
dehnition form as follows. 


Definition 1. A network is an unweighted and undirected simple graph G = {V,E), 
where V and E are the vertex set and the edge set, respectively, and |y| = n, \E\ = m. 
Definition 2. A community structure of network G is a partition of the network, 
denoted as CS = {Gi, G 2 , • • •, Ga:}, where Ci C V, U^^Gj = V and G* fl Cj = 
0 {i,j = l,2,---,/C, and i 7 ^ j), and K is the number of communities in the 
partition. In accordance with the concept of community, an additional constraint, 
Ylf=i {{u,v)\{u,v) E E,u e Ci,v e Ci} » {{u,v)\{u,v) E E,u E Ci,v E 


Cj,i 7 ^ j] 


is always attached to the partition, which means that the edges within 


communities are much denser than those across different communities. 


Definition 3. N{v) is the neighbour set of vertex v, i.e., N{v) = {u\{v,u) E E} 

Definition 4. is the degree of vertex v, it is the number of edges incident to vertex 
V, i.e., = |A^(n)| 

Definition 5. The similarity between a pair of vertices, u and v, is denoted as Sim{u, v). 


3.1. Network sparsification 

The object of the network-sparsihcation algorithm is to make the community boundaries 
more clearer and more sharper by removing some edges from the network, but which 
edges should be removed to reach the goal? The best answer is the edges across 
communities certainly, but that is obviously the ideal scenario because we cannot 
determine which edge across communities conveniently, or the community structure 
can be extracted easily. 

However, according to the concept of community, edges within communities are 
much denser than those across communities, that means every vertex and most of its 
neighbours should belong to the same community. Therefore, if we use a neighbour- 
related measure to calculate the similarity, Sim{u,v), between any pair of vertices, u 
and V, connected by an edge, the similarities between vertices in the same community 
will generally and intuitively be much larger than the counterparts between vertices 
located in different communities. 

Based on this idea, we employ a very simple strategy to sparsify the network. First, 
we dehne the similarity between any pair of vertices, u and v, as follows, 

f \N{u)nN{v)\ . , ^ 

Stm{u,v) = l du . 

I 0 otherwise 
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Obviously, Sim{u,v) ^ Sim{v,u) in general cases, i.e., this similarity is asymmetric. 

Then, we remove the edges which connect pairs of vertices that the similarity 
between them are smaller than a given threshold, 9, from the network, but for edges that 
the degree of any one of end vertices is not larger than 3, we give special consideration. 
The entire procedure is listed as Algorithm 1. 


Algorithm 1 : The network-saparsihcation algorithm 
Input: G{V,E), network; 6, similarity threshold 
Output: G, sparsified network 


1 foreach {u, v) € E do 

2 dmin ■<—minfdu, (i„} 

3 a; argmin^ {du,|w € {m, ?;}} 

4 if dmin ^ 2 then 

5 1^ continue 

6 if dmin = 3 then 

r if max{dw\w S iV(a;)} ^ dmin then 

8 1^ continue 


9 

10 


if {Sim{u,v) < 9) and {Sim{v,u) < 9) then 
.remove_edge(u, v) 


11 return G 


The operations are almost self-explanatory. For each edge in the network, if the 
degree of any one of end vertices is not larger than 2, we bypass this edge directly. For 
the edge that the degree of one of end vertices is equal to 3, we determine whether 
there exists any vertex whose degree is larger than that end vertex in its neighbours or 
not. If no, this edge is also neglected. The aim of these special consideration is to keep 
the network from being partitioned into trivial or even single-vertex communities in the 
network-sparsihcation stage. For each of other edges in the network, we calculate two 
similarities between two end vertices, if both values of the two asymmetric similarities 
are smaller than the given threshold, 6, we remove this edge from the network. And 
hnally, the sparsihed network is returned. 


3.2. Repeated bisection spectral algorithm 


After sparsihcation, we extract the community structure from the sparsihed network 
using our proposed bisection spectral algorithm. Our proposal is a repeated bisection 
spectral algorithm, it is based on the signs of elements of the eigenvector corresponding 
to the second largest eigenvalue of the network transition matrix. 

In Ref. [22], starting from optimizing modularity, Newman derived the formulas 
to describe the rationale of his bisection spectral method (although, it can be ht for 
two-community networks only), and achieved a formula for the modularity 


Q = 


A 

2 ’ 
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where A is the eigenvalue of the generalized eigenvector equation 

As = XDs (1) 

with constraint K^s = 0. Where A is the network adjacency matrix, D is a diagonal 
matrix with elements equal to the vertex degrees, i.e.. Da = dj, K is a. vector with 
elements ki = di, and eigenvector s is the solution vector whose elements equal to ± 1 , 
i.e.. Si = +1 indicates to put vertex i into group 1 , and Si = —1, into group 2 . 

Our proposed repeated bisection spectral algorithm is also based on the above 
formulas. Let us consider first the scenario bisecting a network in two parts, and then 
call it repeatedly. To maximize the value of the modularity Q, we should choose A to 
be the largest eigenvalue of Eq. (1), but it is impossible here. Because it is obvious that 
vector s = 1 = (1,1,•••,1)^ is an eigenvector of Eq. (1), and according to the Perron- 
Frobenius theorem, it must corresponds to the largest (most positive) eigenvalue, but 
s = 1 = (1,1, • • •, 1)^ fails to satisfy constraint K^s = 0. Therefore, what we can do 
best is to choose A to be the second largest eigenvalue to maximize the modularity, Q, 
and to choose the corresponding eigenvector to be the solution vector s. However, this 
eigenvector is a real-number vector, considering the constraint of s, Si = ± 1 , we can 
simply round the value of Sj to ±1 to get the solution vector. This operation is equivalent 
to checking the signs of elements of s to put the corresponding vertices into group 1 , 
or into group 2. Hereafter in this paper, we call “the eigenvector corresponding to the 
second largest eigenvalue” as “the second eigenvector” to facilitate the description. 

To solve Eq. (1), we simply rearrange its terms, and obtain thatl 

D~^As = As. (2) 

The matrix 

T = D-^A (3) 

is the transition matrix corresponding to random walk in the network, our proposed 
algorithm is based on it: for the sparsified network, we compute the second eigenvector 
of the transition matrix, and then divide the vertices of the sparsified network into 
two communities according to the signs of the second eigenvector elements. This 
is a bisection operation that divide the network vertices into two communities only, 
to extract the resulting community structure containing multiple communities, we 
construct a subnetwork for each community, and from all subnetworks, the one whose 
split can lead to a new community structure with the maximal modularity is selected 
to perform the bisection division really. This division operation is repeated until the 
community number reaches the given number of communities, K. 

The pseudo code outlining the entire procedure is listed in Algorithm 2 . After 
sparsihcation, the network itself might become disconnected. We take each connected 
component as a community, and all of the connected components comprise the initial 
community structure CS. Next, for each community Ci G CS, a subnetwork of G, 

^ The matrix D is invertible because all subnetworks involved are connected. 
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SQi, is constructed and bisected into two subgraphs afterwards by calling function 
“spectra_bisection()”, then we calculate the modularity of the new community structure 
corresponding to this bisection. From all bisections, the one with the maximal modu¬ 
larity (the corresponding community is Cj) is selected to be accepted as the real division 
by removing Cj from CS and inserting two obtained communities Cji and Cj 2 into CS. 
This operation is repeated until the number of communities reaches /C, and we obtain 
the resulting community structure hnally. 


Algorithm 2 : The repeated bisection spectral community detection algorithm 
Input: G{V,E), network; K,, number of communities in the resulting community structure 
Output: CS, community structure 

1 CS v- G.connected_components() 

2 while ICS'I < a: do 

3 for each Ci € CS do 

4 SQi •«— G.subgraph(Cd 

5 {Cii,Ci 2 ) ■(—spectra_bisection(sgi) 

6 calculate modularity, denoted as Qi, of the community structure supposing that Ci is 
removed from G S, Cu and Ci 2 are inserted into G S 

7 j-<-argmax,{Q,|z = 1,2, • • •, |G5|} 

8 CS ^ GS'VfGj} 

9 |_ CS ^ CSLl{Cji,Cj2} 

10 return CS 

Function spectra_bisection(sg) 

1 A V- sgr.adjacency_matrix() 

2 D diag{ai) /* where Ci = , z = 1, 2, • • •, n. */ 

3 T ^ D-^A 

4 (A 2 ,a: 2 ) t—secondJargest_evaLevec(T) 

S X2 

5 Gi ^ > 0} 

0 G 2 t— {f|s[u] < 0} 

7 return (Gi, G 2 ) 


In Algorithm 2, the function “spectra_bisection()” is responsible for the bisection 
operation of the network/subnetwork, sg. In this function, the second largest eigenvalue 
A 2 and the corresponding eigenvector X 2 of transition matrix T are computed hrst. 
Then, X 2 is taken as the solution vector s, and the vertices corresponding to the positive 
elements and the negative elements of s are put into group Ci and group C 2 , respectively. 
At last, the tuple of the two groups, (Ci, C2), is returned as the result. 

For the community number, K, although some strategies, including some spectral 
strategies [23, 28, 26], can be used to determine its value from the network automatically. 
But in practice, the numbers obtained using these strategies always differ from the exact 
numbers of communities contained in the ground-truth community structures more or 
less. In fact, to our knowledge, how to determine the exact number of communities 
contained in a network is still a challenging problem. Therefore, we do not invest time 
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to acquire the community number here, but take it as a parameter of our proposed 
algorithm, and give its value on each network directly in our experiments instead. 

3.3. Implementation techniques 

At first glance, it seems that we need to invoke the bisection for each community 
in current community structure by calling the function “spectra_bisection()” in each 
iteration of the “while” loop in Algorithm 2 , to select the community whose bisection 
can lead to a new community structure with the maximal modularity. However, it is 
obvious that a large amount of bisections are duplicated, which leads to a lower efficiency. 

To implement the algorithm efficiently, rather than bisecting each community in 
current community structure in each iteration, we maintain a binary tree to track the 
entire division procedure, which is constructed as follows. 

• It begins with vertex set V in the original network as its root; 

• If the network is disconnected after sparsification, one community in the initial 
community structure is taken as left child of the root, the other communities of the 
initial community structure are taken as right child of the root. If the right child 
contains more than one community, we take it as a new root, one community in 
it as its left child, and the remainder communities as its right child to construct a 
subtree recursively. Figure 2(a) shows an example binary tree of a such network 
whose initial community structure contains 3 components after sparsification. 

• For each community in current community structure, after it is bisected for the 
first time when selecting the community to perform the real bisection division, 
we attach the two groups to the community as its sentinel child nodes. In the 
subsequent iterations, these two sentinel nodes are used directly instead of bisecting 
that community again. For instance. Figure 2(b) illustrates a new version of the 
binary tree shown in Figure 2(a) with sentinel nodes attached. 

• For the selected community, to reflect the result that its bisection is accepted as the 
real division, its sentinels are altered to its left child and right child, respectively. 
Figure 3 shows an alteration example, in Figure 3(a), Cj is the selected community, 
Cji and Cj2 are two sentinels of Cj, which is obtained by previous bisection 
operation. After the bisection of Cj is accepted as the real division, the status of the 
binary tree is as presented in Figure 3(b). In the next iteration, the communities 
needed to be bisected are Cji and Cj 2 only, not all of the communities in current 
community structure. The sentinels of Cji and Cj 2 are also plotted in Figure 3(b). 

As mentioned above, the entire division procedure are tracked in this binary tree. 
With its aid, each community is needed to perform the bisection only once, and the 
current community structure consists of all of the leaf nodes (not the sentinel nodes) 
in each iteration. However, to locate a community in current community structure, we 
need to traverse a path from the root to the leaf node corresponding to that community, 
this traverse can be quite time consuming for large networks. To reduce the time 


A divisive spectral method for network community detection 


10 


V = {CuC2,C3} 


Cl 



V — {Cl, C2, C3} 


Cl 



(a) 


(b) 


Figure 2. An example binary tree for the initial community structure 


containing 3 disconnected components, Ci, C 2 , and C 3 , after sparsification. 

(a). The binary tree for the initial community structure, (b). The new version of the 
binary tree with the sentinel nodes attached. The nodes plotted in square represent 
the sentinel nodes of communities, each community and its sentinel nodes connect with 
dashed lines. 





(a) 


(b) 


Figure 3. The alteration of sentinel nodes to left child and right child of 
the selected community, (a) Cj is the selected community, Cji and Cj2 are the two 
sentinel nodes of Cj. (b) Cji and Cj2 are altered to left child and right child of Cj, 
respectively, which means Cj is removed from current community structure and Cji 
and Cj2 are inserted into it. The sentinel nodes of Cji and Cj2 are also plotted. 


consumption, we assign every node in the binary tree a number just as the tree is 
organized as a complete binary tree in logical, i.e., the number of root is 1, and for each 
node Cj, if its number is j, then the numbers of its left child and right child are 2 x j 
and 2 X i + 1, respectively. Furthermore, we construct a hash table, which takes the 
numbers of nodes as its keys, to map the number to the position of the corresponding 
node in the tree. With the aid of the hash table, we can locate any community in the 
tree efficiently not only, but also need not to traverse the binary tree when determining 
whether a node, whose number is i, is a leaf node or not, but to check instead whether 
2 X z or 2 X i + 1 is in the key set of the hash table or not quickly. 

4. Experiments 
4.1. Networks 

To evaluate the effectiveness of our proposed method, we conducted extensive 
experiments on 5 real-world networks, namely Zachary’s karate club network [30, 1, 2], 
Lusseau’s bottlenose dolphin social network [31], a map used in the popular strategy 
board game Risk [32], a collaboration network of scientists working at the Santa Fe 
Institute [1], and a network representing the schedule of regular season Division I 
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American college football games for year 2000 season [1]. These networks are publicly 
available and their ground-truth community structures are already known, facilitating 
the verihcation and the validation of the proposed method, their scales are small enough 
alleviating the burden of interpretation and visualization of the results. Therefore, they 
are widely used as benchmarks for testing community detection algorithms or methods. 
The statistical information of them are listed in Table 1. 

Table 1. The statistical information of the 5 networks used in our 
experiments. 


network 

vertices 

edges 

communities 

karate 

34 

78 

2 

dolphin 

62 

159 

2 

Risk map 

42 

83 

6 

scientist’s collaboration 

118 

197 

6 

college football game schedule 

115 

613 

12 


4.2. Evaluation metrics 


To measure the strength of the extracted community structure, the modularity [2], which 
is denoted as Q and dehned as: 

K 

Q = - a-), 

i 

is a de facto metric at present, where ea is the ratio of the edges within communities to 
the total edges in the network, and is the expected value of the ratio. 

The modularity suffers from the so-called resolution limit problem [33]. Therefore, 
we use two other metrics, namely accuracy and NMI (Normalized Mutual Information) 
[34], to evaluate the quality of the extracted community structure as well. The accuracy, 
denoted as A, is dehned as the fraction of the vertices being classihed into the correct 
communities to the total vertices in the network. And NMI is dehne as: 


NMI 


\P\ |c| 

-2 riij log 

i=i j=i 







where P = {Pi, P 2 , • • •, Pr'} and C = {Ci, (^ 2 , • • •, Cr} are the extracted community 
structure and the ground-truth community structure, respectively, nf = |Pj|, = ICt,!, 

and riij = \Pif\Cj\. 

Both the accuracy and NMI take the ground-truth community structure as a 
baseline to measure how the extracted community structure approaches the ground 
truth, and then measure the ability of the community detection methods or algorithms. 
They both fall in the range [0,1], larger is better. 
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4.3. Comparison system and parameter settings 

Apart from being a bisection spectral method, our proposal falls in the category 
of divisive hierarchical methods as well. Therefore, to testify the superiority of 
our proposal, we ran the proposed method on the 5 networks and compared the 
results not only with two spectral-analysis based algorithms, namely the standard 
spectral clustering algorithm [16] and the modularity-matrix based bisection spectral 
algorithm proposed by Newman [24] (abbreviated as Newman2006), but also with a 
novel hierarchical algorithm, Infohiermap [35], which identihes hierarchical community 
structures from networks via hnding the shortest multilevel description of a random 
walk in networks. For the spectral clustering algorithm, its results are not deterministic, 
because it exploits the X-means algorithm to cluster the vertices, we present the result 
occurred most frequently in 20 runs of the algorithm here. 

In addition, for the 2 two-community networks, we also made a comparison between 
the results of our proposed method and Newman’s method described in Ref. [22] 
(shorted as Newman2013) as Newman2013 can be only applied to two-community 
networks. Furthermore, on all 5 networks, we compared the results of our proposal 
with the results extracted by the proposed repeated bisection spectral algorithm only 
without network sparsification to demonstrate the effectiveness of the proposed network 
sparsihcation algorithm. Hereafter, we refer the proposed method with network spar- 
sihcation as the complete version of our proposal (Algorithm 1 -|- Algorithm 2) and 
the proposed repeated bisection spectral algorithm without network sparsification 
(Algoirthm 2 only) as the lite version of our proposal, respectively. 

For the proposed method, the similarity threshold 6 in Algorithm 1 works as a 
parameter to control the number of edges to be removed from the network. Its setting 
is crucial for the method, too large 6 will hlter out too many edges from the network, 
that may even destroy the skeleton of communities, leading to the failure of identifying 
them from the network; on the contrary, too small 9 may lead to the result that few 
edges between communities are removed, so that the boundaries between communities 
will not be as clear as expected after sparsification. That is to say, the sparisification 
algorithm might not take its effect if 9 is too small. After taking a sequence of values 
in [0,0.6] as 9 and 0.05 as an increment each time to carry out the experiments on each 
network, we concluded that 9 = 0.15 seems to be the best setting for all 5 networks. For 
other networks, we suggest empirically that the mode of similarity values in [0.1, 0.2] be 
taken as the value of 9. 

For the parameter /C in Algorithm 2, which points out the number of communities 
in the resulting community structure, thus its value is naturally set to be the number 
listed in the last column in Table 1 on each network, correspondingly. 

4.4. Experimental results 

Zachary’s karate club network. This network contains 34 vertices and 78 edges, in 
which vertices represent members of a karate club, edges represent social interactions 


A divisive spectral method for network community detection 


13 


between members being observed within or away from the karate club. Later, the club 
split into two factions because of a dispute between the administrator and the instructor. 
Matched with the two factions, the network contains two communities, whose structure 
is shown in Figure 4(a). Feeding this network into the comparison algorithms and our 
proposed method, we obtained the results shown in Figures 4(b)-4(g), respectively. And 
the values of the three metrics obtained on this network are listed in Table 2. 



Figure 4. Zachary’s karate club network, (a) The ground-truth community 
structure; (b) The community structure identified by the spectral clustering algorithm; 
(c) The community structure detected by Newman2006; (d) The community structure 
revealed by Infohiermap; (e) The community structure found by Newman2013; (f) The 
community structure extracted by the lite version of our proposal; (g) The community 
structure extracted by the complete version of our proposed method. 


On this network, in the result of the spectral clustering algorithm, one vertex is 
classihed in the incorrect community. For Newman2006, although it is originated from 
modularity optimization, the modularity of its result is smaller than that of Infohiermap, 
the latter is the highest on this network, but both of their community structures deviate 
far from the ground truth. Newman2013 bisected the network into two communities 
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with one vertex being misclassified also. For the lite version of our proposal, it obtained 
the same result as Newman2013, which is not a coincidence because the matrix under 
them are both derived from Eq. 1. Compared with them, The result extracted by the 
complete version of our proposed method is identical to the ground-truth community 
structure, i.e., our proposal acquired the best result on this network. 


0.3 

0.2 

0.1 

0.0 

- 0.1 

- 0.2 


Figure 5. The change of values of the second eigenvector elements on 
Zachary’s karate club network. The left panel shows the case of the original 
network without sparsification, and the right panel is the case corresponding to the 
network after sparsihcation. 

Furthermore, the change of values of the second eigenvector elements on this 
network without sparsihcation and after sparsihcation is illustrated in Figure 5. The gap 
between the positive elements and the negative elements of the second eigenvector after 
sparsihcation is much larger than the counterpart without sparsihcation apparently, 
which means that the boundary between the two communities becomes more clearer 
and more sharper because of the sparsihcation, which demonstrates the ehectiveness of 
the proposed network sparsihcation algorithm to some extent. 

Lusseau’s bottlenose dolphin social network. This network consists of 62 
vertices and 159 edges, in which vertices represent bottlenose dolphins living in Doubtful 
Sound, New Zealand. If two dolphins are observed to be co-occurring more often than 
expected occasionally, there is an edge between them representing their association. 

The ground-truth community structure of this network is illustrated in Figure 6(a)^, 
Figures 6(b)-6(g) show the resulting community structures extracted by the comparison 
algorithms and the proposed method, individually, and the values of the three metrics 
acquired on this network are also hlled in Table 2. 

On this network, the spectral clustering algorithm got the result closest to the 
ground-truth community structure, in which only one vertex was misclassihed. For 
Newman2006 and Infohiermap, they both extracted more than two communities from 
this network, but both of them differ far from the ground truth^^. For Newman2013, 
for the lite version and the complete version of our proposed method, they all identihed 
the same community structure from this network, in which two vertices were wrongly 
classihed into the opposite community. It seems that the proposed network sparsification 
algorithm failed to sparsify the network, so that the complete version of our proposal 

^ Although, this network can also be considered containing 4 communities, we take it as a two- 
community network in this paper as in Ref. [22]. 

In addition, they also departure far from the four-community ground-truth structure of this network. 
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(e) (f) (g) 


Figure 6. Lusseau’s bottlenose dolphin social network, (a) The ground- 
truth community structure; (b) The community structure detected by the spectral 
clustering algorithm; (c) The community structure extracted by Newman2006; (d) 
The community structure identified by Infohiermap; (e) The community structure 
revealed by Newman2013; (f) The community structure uncovered by the lite version 
of the proposal; (g) The community structure detected by the complete version of our 
proposal. 
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Figure 7. The change of values of the second eigenvector elements on 
Lusseau’s bottlenose dolphin social network. The left panel shows the case 
of the original network without sparsification, and the right panel presents the case 
after sparsification. 
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did not acquire the better result than that of without sparsihcation. But that is not 
the case. To demonstrate the case, as did in Figure 5, we also plotted the change 
of values of the second eigenvector elements on this network without sparsihcation and 
after sparsihcation in Figure 7. Evidently, the gap between the positive elements and the 
negative elements is also much larger after sparsihcation than that without sparsihcation, 
which means that our proposed network sparsihcation algorithm does take its ehect on 
this network. 

Risk map network. This network is a map of the popular strategy board game, 
Risk^. It is a political map of the Earth, divided into 42 territories, which are grouped 
into 6 continents. Therefore, this network is comprised of 42 vertices and 83 edges. 
In accordance with the 6 continents naturally, the ground-truth community structure 
of this network is as shown in Figure 8(a), running the comparison algorithms and 
the proposed method on this network, we obtained the results illustrated in Figures 
8(b)-8(f), respectively, and the values of the three metrics achieved on this network are 
enumerated in Table 2 as well. 



Figure 8. Risk map network, (a) The ground-truth community structure; (b) 
The community structure detected by the spectral clustering algorithm; (c) The 
community structure found by Newman2006; (d) The community structure revealed by 
Infohiermap; (e) The community structure identified by the life version of the proposal; 
(f) The community structure detected by the complete version of our proposed method. 

In this network, vertices “26”, “12”, “16”, and “33” are special ones. Taking vertex 
“26” as an example, there are 6 edges associated with it, but they are incident to 3 
different communities with 2 edges each community. It is hard to determine in which 

^ https://en.Wikipedia.org/?title=Risk_(game) 
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community this vertex should belong according to the topological information only. 
The similar scenarios occur also for the other special vertices, it is reasonable that they 
be classihed into any community incident to them, without considering the physical 
meaning of the vertices. For this reason, mistakes around these vertices tend to be 
introduced by community detection algorithms. 

All of the results of the spectral clustering algorithm, of Newman2006, and of the 
lite version and the complete version of our proposal contain misclassifications of one 
or more of these special vertices. An exception is Infohiermap, it incredibly classihed 
all these special vertices correctly. But it split the community located at the right 
top of the panel into two, resulting in a lower accuracy. For our proposal, after network 
sparsihcation, the method eliminated most of the mistakes and extracted the community 
structure with a high degree of success, all but one of the territories are grouped correctly 
with the other territories in their continent, the community structure is the best one 
among those of other algorithms. 

Scientist’s collaboration network. This network depicts coauthor relationship 
between 118 scientists working at the Santa Fe Institute, in which each vertex represents 
a scientist, and each edge connects two scientists who have coauthored at least one 
article. It contains 118 vertices and 197 edges, and can be naturally partitioned into 
6 communities according to the scientists’ specialities. The ground-truth community 
structure and the results extracted by the comparison algorithms and the proposed 
method are visualized in Figures 9(a)-9(g), severally, and the values of the three metrics 
are also listed in Table 2. 

On this network, the spectral clustering algorithm merged two communities (plotted 
in cyan pentagon and in purple circle in Figure 9(a), respectively) into one, but split one 
community (plotted in light blue heptagon in Figure 9(a)) into two. Besides this, there 
are 10 vertices (vertices “33”, “39”, “40”, “41”, “102”, “103”, “104”, “106”, “107”, and 
“108”) were classified into the incorrect communities, i.e., the quality of the resulting 
community structure is not so high. For Newman2006, the quality of the result is also 
quite poor, several vertex groups extracted are too trivial to be accepted as communities. 
For Infohiermap, it revealed two levels of community structures from this network. The 
first level contains only 3 communities, and the second level consists of 16 communities 
exagger at ively, both of them deviated far from the ground-truth community structure. 
In the result of the lite version of the proposed method, vertices “27”, “28”, “29”, 
“102”, “103”,“104”, “106”, “108”, and “109” were misclassihed, and after sparsihcation, 
the mistakes introduced on the former three vertices were eliminated by the complete 
version of our proposal. Unfortunately however, there are still 6 vertices that were 
classihed into the incorrect community in the hnal community structure. Even though, 
the resulting structure of our proposed method is the one closest to the ground-truth 
community structure. Which means, compared with other algorithms, our proposed 
method extracted the best community structure from this network. 

College football game schedule network. This network is the schedule of 
regular season Division I American college football games for year 2000 season. It is 
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(a) 



Figure 9. Scientist’s collaboration network, (a) The ground-truth community 
structure; (b) The community structure identified by the spectral clustering algorithm; 
(c) The community structure revealed by Newman2006; (d) The first-level community 
structure extracted by Infohiermap; (e) The second-level community structure 
extracted by Infohiermap; (f) The community structure found by the lite version of 
the proposal; (g) The community structure detected by the complete version of the 
proposed method. 


made up of 115 vertices and 613 edges, in which vertices represent teams and edges 
represent regular season games between the two teams they connect. The teams are 
divided into 12 “conferences”, and games are more frequent between teams of the same 
conference than between teams of different conferences. Therefore, each conference is a 
natural community, and the ground-truth community structure is accordingly as shown 
in Figure 10(a). Applying the comparison algorithms and the proposed method to this 
network, we achieved the resulting community structures presented in Figures 10(b)- 
10(f), correspondingly, and the values of the three metrics obtained on this network are 
hlled in Table 2 as well. 

On this network, the spectral clustering algorithm tended to merge two or more 
communities into one, but to separate a small portion of vertices from some communities 
to form another communities (not only in the result presented here, but also in other 
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(d) (e) (f) 


Figure 10. College football game schedule network. (a) The ground- 
truth community structure; (b) The community structure extracted by the spectral 
clustering algorithm; (c) The community structure identified by Newman2006; (d) 
The community structure uncovered by Infohiermap; (e) The community structure 
revealed by the lite version of the proposal; (f) The community structure detected by 
the complete version of the proposed method. 

results of the 20 runs of the algorithm on this network). For Newman2006, the quality 
of the result is quite poor as many vertices were classihed into the incorrect communities 
wrongly. The similar result occurred for the lite version of the proposal, there exist too 
much misclassihcation of vertices. After sparsihcation, all mistakes were eliminated, 
the result of the complete version of our proposed method is identical to the ground 
truth. For Infohiermap, the extracted structure is almost identical with the ground- 
truth community structure, except for one vertex being misclassihed. These results 
demonstrate that our proposed method performs the best again on this network. 

At last. Let’s make an analysis on the values of the three evaluation metrics, which 
have been recorded in Table 2 in the procedure of experiments. From the perspective of 
the modularity, Infohiermap achieved the largest value 3 times (on the karate club 
network, the dolphin social network and the Risk map network, respectively), the 
complete version of our proposal acquired twice (on the scientist’s collaboration network 
and the football game schedule network, respectively). Other algorithms have no chance 
to get the largest value on any one of the 5 networks. Considering from the perspective 
of the accuracy and NMI, except for being the second once (on the dolphin social 
network) only by a very small offset to the spectral clustering algorithm, our proposed 
method obtained the largest value on all 4 other networks steadily. Considering the 
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Table 2. The comparisons of the 3 metrics. We report the rank of each algorithm 
(in parentheses) on each metric per network, each score value in the last but one 
column is the average of the three metrics of each algorithm. The highest rank and 


the 

corresponding algorithm 

or method 

is shown in 

bold. 



network 

algorithm 

Q 

A 

NMI 

score 

rank 

karate 

ground truth 

0.371 

1.00 

1.00 




spectral clustering 

0.313(6) 

0.912(4) 

0.646(6) 

4.667 

6 


Newman2006 

0.393(2) 

0.618(6) 

0.677(5) 

4.333 

5 


Infohiermap 

0.402(1) 

0.824(5) 

0.699(4) 

3.333 

4 


Newman2013 

0.360(4) 

0.971(2) 

0.836(2) 

2.667 

2 


life 

0.360(4) 

0.971(2) 

0.836(2) 

2.667 

2 


proposal 

0.371(3) 

1.00(1) 

1.00(1) 

1.667 

1 

dolphin 

ground truth 

0.373 

1.00 

1.00 




spectral clustering 

0.379(6) 

0.984(1) 

0.889(1) 

2.667 

5 


Newman2006 

0.491(2) 

0.484(6) 

0.449(6) 

4.667 

6 


Infohiermap 

0.525(1) 

0.581(5) 

0.566(5) 

3.667 

4 


Newman2013 

0.385(3) 

0.968(2) 

0.814(2) 

1.667 

1 


life 

0.385(3) 

0.968(2) 

0.814(2) 

1.667 

1 


proposal 

0.385(3) 

0.968(2) 

0.814(2) 

1.667 

1 

Risk map 

ground truth 

0.621 

1.00 

1.00 




spectral clustering 

0.589(3) 

0.833(3) 

0.818(3) 

3.000 

3 


Newman2006 

0.547(5) 

0.762(4) 

0.723(4) 

4.333 

4 


Infohiermap 

0.634(1) 

0.857(2) 

0.945(2) 

1.667 

2 


life 

0.554(4) 

0.643(5) 

0.705(5) 

4.667 

5 


proposal 

0.631(2) 

0.976(1) 

0.956(1) 

1.333 

1 

collaboration 

ground truth 

0.739 

1.00 

1.00 




spectral clustering 

0.695(5) 

0.703(4) 

0.772(5) 

4.667 

4 


Newman2006 

0.708(3) 

0.831(3) 

0.834(3) 

3.000 

3 


Infohiermap^^* 

0.651(6) 

0.636(5) 

0.764(6) 

5.667 

6 


Infohiermap 

0.704(4) 

0.602(6) 

0.805(4) 

4.667 

4 


life 

0.734(2) 

0.924(2) 

0.895(2) 

2.000 

2 


proposal 

0.740(1) 

0.949(1) 

0.936(1) 

1.000 

1 

football 

ground truth 

0.601 

1.00 

1.00 




spectral clustering 

0.538(3) 

0.791(4) 

0.908(3) 

3.333 

3 


Newman2006 

0.493(5) 

0.652(5) 

0.758(5) 

5.000 

5 


Infohiermap 

0.600(2) 

0.991(2) 

0.989(2) 

2.000 

2 


life 

0.503(4) 

0.809(3) 

0.811(4) 

3.667 

4 


proposal 

0.601(1) 

1.000(1) 

1.000(1) 

1.000 

1 


life: the proposed repeated bisection algorithm without network sparsification; proposal: the 
complete version of our proposed method. Infohiermap^®*, Infohiermap^"'^: the first-level and 
the second-level community structures extracted by Infohiermap, respectively. 
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meaning of the accuracy and NMI, these results suggest that the community structure 
extracted by our proposed method approaches the ground-truth community structure 
most. Furthermore, we attached a rank (the number in the parentheses) to each value 
on each metric per network, and calculated a score to rank the algorithms or methods 
totally by averaging the rank numbers of every algorithm. The hnal rank of every 
algorithm is listed in the last column of Table 2, which conhrm that our proposed 
method performs much better than the comparison algorithms. 

5. Conclusion and future work 

In this paper, we proposed a novel spectral method to identify community structures 
from networks, which is a combination of a network-sparsihcation algorithm and 
a repeated bisection spectral community detection algorithm. First, some inter¬ 
community edges are removed by the sparsihcation algorithm to make the community 
structure more prominent, then the repeated bisection spectral algorithm extract the 
community structure accurately from the sparsihed network. We have conducted 
extensive experiments on 5 real-world networks, and the experimental results show that 
our proposed method is superior to the comparison algorithm signihcantly. 

The network sparsihcation algorithm is of great importance to our proposed 
method. To be frank, the strategy employed to remove some edges from the network in 
this paper is a bit too naive, the similarity threshold, 6, is in fact a global parameter, so 
the network sparsihcation determine whether to remove an edge or not from the global 
perspective of the entire network, without considering any local property of any end 
vertex of the edge. Hence, some edges across communities but located in the region 
that the connection is relatively denser will not be removed, this might inhuence the 
quality of the result. And this might be the reason why there are still 6 vertices that are 
misclassihed in the resulting community structure extracted by our proposed method 
from the scientist’s collaboration network. 

Therefore, although the network sparsihcation algorithm proposed in this paper 
does take its effect, we think that a sophisticated network sparsihcation strategy 
exploiting the local properties of edges, e.g., the densities of end vertices, will perform 
better. And network sparsihcation might be a research direction in the future not only 
for the need of community detection, but also for the demand of efhciency considering 
the larger and larger scales of networks. 
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